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Current Activities 

Clustering Binary Objects 

Brian T. Luke 

Binary objects are simply objects that are described by a finite-length bit 
string. They are often used in chemical databases because they can be very 
useful in searching for a particular compound. If the search string has a 
particular bit turned ON and a database compound does not, then that 
structure can be passed over. Two possible representations are Molecular 
Fingerprints and Structural Keys . 

The bit strings that characterize two objects can also be used to calculate a 
"distance." This effective distance can then be used with a clustering 
algorithm to place the objects into groups. One of the most popular methods 
for calculating the distance between two objects is to first determine their 
similarity. If a general similarity metric is denoted S, it usually has a value 
between 0.0 (the bit strings have no ON bits in common) and 1.0 (the two 

bit strings are identical). A distance can then be given by A(1-S) M , where 
M is a non-negative exponent and A is a constant. Conversely, each bit 
string can be thought of as components of an L-dimensional vector and the 
similarity as just the cosine of the angle between them (which is actually the 
case for the Ochini similarity). Therefore, a distance can be (A)arcos(S), or 
just a constant times the value of this angle. 

If the bit string has a length of L, it is possible to go down this string and 
count the number of times a bit is ON in both strings, ON in one and OFF in 
the other, or OFF in both strings. The four sums are presented in the table 
below. 
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Object j 


0 


1 


Object 
/ 


0 


B oo 


B 01 


1 


B 10 





The first subscript refers to the value of the bit for object i and the second 
for object j 9 summed over all L bits. Therefore, B 1Q is the number of times a 

bit is ON in i and OFF in j. 

Some other symbols that will be used in this section are 

Bj ( = B 10 + Bj j) is the total number of ON bits in object i. 

Bj ( = Bqj + Bj j) is the total number of ON bits in object j. 

B c ( = B j j ) is the total number of times a bit is ON in both bit strings. 

Bj ( = Bqq + Bj j) is the total number of times the two bit strings agree. 

L ( = Bqq + Bqj + Bjq + Bj j) is the length of the bit string. 

With these definintions, commonly used similarity metrics are presented in 
the following table. 
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Label 


Equation 


Simple Matching 
Sokal & Michener 


- SM = Bj/L 


Russel & Rao 


RR = B C /L 


Tanamoto Coefficient 


TC = B c /(Bj+Bj-B c ) 


Dice, Czekanowski, or Sorensen 
or Nei & Lie's genetic distance 


DCS = 2B c /(Bj+Bj) 


Rogers & Tanamoto 


RT = Bj/^L-B,) 


lllj^^ of Community 


JCC = B^CL-Boq) 


Kulczynski 


KS - (l^XBc/Bj+Bj/Bj) 


Ochini 

(cosine similarity function [1]) 


cs= Bc 


R Package, v4.0 


S03 = 2B I /(L+B I ) 

509 = 3B C /(3B C +B 10 +B 01 ) 

510 = B r /(B r +2B 1ft +2B m ) 


Jaccard's Similarity 


JS i-j = Bc^i 
JS j-i = B </ B j 


Tversky Index 


TI ab = B c /(aB 10 +bB 01 +B c ) 



The last two metrics are different from the rest. Jaccard ! s Similarity is 
different in that it is not symmetrical; the similarity of i to j is different than 
the similarity of j to i unless the two bit strings have exactly the same 
number of ON bits (BpBj). This metric is useful if you want to measure a 

"substructure similarity." If i is a substructure of j\ then every bit that is ON 
in i will also be ON in j though j will also have other bits ON. The other 
metrics will yield a similarity less than 1.0, while JS— produces a similarity 

of 1.0. Similarly, if j has less bits ON than z, JS— can be used to measure a 

substructure distance. 

The Tversky Index is unique in that it has two adjustable parameters, a and 
b. If a-1 and b=0, TIjq=JS— . If a+b=l, this expression can be rewritten as 

TI ab = B c /(aB ;+ bB j ) 

and if they are both equal to one-half, this just becomes one-half of 
Sorensen's Index, SI. This also means that for particular values of a and b, 
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the Tversky Index may have to be scaled so that identical bit strings produce 
a similarity of 1.0. 

A dissimilarity metric, D, is a measure of the difference between two bit 
strings. This metric has a value of 0.0 for identical bit strings and increases 
as they become less similar. It can therefore be scaled by any constant to 
yield a distance. A dissimilarity metric can easily be formed by taking 
D=(l-S), where S is a similarity value from the table above. Certain 
dissimilarities presented in the literature are given in the table below. They 
are proportional to either the sum or the product if the disagreement 
between the bit strings, B 1Q and B Q1 (note that Bjq+Bq^L-Bj is the 

Hamming Distance between these bit strings). 



Label 


Equation 


Euclidean Dissimilarity 


DE = y(L-B,yL 


Squared Euclidean Dissimilarity 


DSE = (L-Bj)/L 


Pattern Difference 


DP = 4B 10 B 01 /L 


Lance & Williams 
Bray-Curtis nonmetric coefficient 


DLW = (L-BjVCBi+Bj) 


Squared Euclidean 
Substructure Dissimilarity 


DSESj^CBj-BcyB^Bij/B. 
DSES^ = (3^)^=3^ 



In chemical databases, this metrics can yield a significant dissimilarity 
(distance) if a substructure of a molecule is compared to the full molecule. 
To remove this problem, Daylight Systems developed a Squared Euclidean 
Substructure Distance, DSES. If compound i is smaller than compound j 
(less bits turned ON), DSES— may yield a more accurate distance. 

Similariy, if object j has less bits turned ON than i (Bj < Bj), DSES— can be 

used. 

With any of these distance metrics, a clustering algorithm can be used to 
place these binary objects into groups. Since it is not possible to calculate an 
average position between two binary objects (unless they are identical), one 
cannot determine a centroid for a group of binary objects. To circumvent 
this, a medoid [2] can be used instead. A medoid is simply a bit string that 
minimizes the sum of the distance to all objects in the group, and can be 
used with Wards Method (described in Agglomerative Linkages ) and 
K-means clustering . 
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http://216J39J7J04/search?q=cache:q^^ 



In 1987, Peter Willett [3] studied the clustering of chemical databases and 
found that Jarvis-Patrick clustering [4] produced the best results. 

I used a similar procedure to try to cluster nucleotide sequences. Because of 
the large variation in the number of l's for different bit strings, I found that 
the following distance metric produced better results. 

One final point is that it is possible to weight each bit in the string. If b ik is 
the bit in position k (k=l,2,...L) for object z, it can be weighted by w k . The 
terms in the above expressions become 

B,= S^w k b ik 

B c = SwW k b lk b, k 

B , = w k s lk 

S* = 1 if b ik = b ik 
0 if b ik ^ b ik 

The weight for each bit can be related to the "inverse frequency" of being 
turned on. In other words, if this bit is on for all N objects to be clustered, it 
cannot be used as a descriminator. Conversely, if it is on in only a few of 
the objects, it may be an important descriminator and should be given a 
relatively large weight. One such inverse frequency weight is given by the 
expression 

w k = log(N/f k ) 

where f k is the number of objects that have this bit turned on. 

Note that the discussion above is limited to a comparison of bit strings that 
all have the same length. Clustering Character Objects describes metrics for 
comparing character strings, and character or binary strings of different 
length. 
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